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Abstract — One of the most challenging problems in Opportunis- 
tic Spectrum Access (OSA) is to design channel sensing-based pro- 
tocol in multi secondary users (SUs) network. Quality of Service 
(QoS) requirements for SUs have significant implications on this 
protocol design. In this paper, we propose a new method to find 
joint policies for SUs which not only guarantees QoS requirements 
but also maximizes network throughput. We use Decentralized 
Partially Observable Markov Decision Process (Dec-POMDP) 
to formulate interactions between SUs. Meanwhile, a tractable 
approach for Dec-POMDP is utilized to extract sub-optimum joint 
policies for large horizons. Among these policies, the joint policy 
which guarantees QoS requirements is selected as the joint sensing 
strategy for SUs. To show the efficiency of the proposed method, 
we consider two SUs trying to access two-channel primary users 
(PUs) network modeled by discrete Markov chains. Simulations 
demonstrate three interesting findings: 1- Optimum joint policies 
for large horizons can be obtained using the proposed method. 
2- There exists a joint policy for the assumed QoS constraints. 3- 
Our method outperforms other related works in terms of network 
throughput. 

Index Terms — Decentralized Partially Observable Markov De- 
cision Process (Dec-POMDP), Dynamic Programming (DP), Qual- 
ity of Service (QoS), Opportunistic Spectrum Access (OSA). 

I. Introduction 

With the advent of the new applications in wireless data 
networks, bandwidth demand has increased, intensively. The 
majority of the usable frequency spectrum for wireless net- 
works has already been assigned to licensed users. In contrast 
to the apparent spectrum scarcity, extensive measurements 
indicate that a large portion of licensed spectrum lies unused 
__. Thus, there is an intensive research attempt to present 
new techniques to utilize the unoccupied resources, efficiently 
_Q-_Q- To get higher frequency reuse efficiency, SUs should 
dynamically access PUs' channels. This concept is known 
as Opportunistic Spectrum Access (OSA) in literature __. In 
cognitive radio networks, channel occupation can be caused by 
two effects __: One is the disturbance due to PUs' activities 
which can be modeled by finite state Markov chain |7|. The 
other is the impact caused by other SUs' transmission __. 

Zhao et al. considered the Partially Observable Markov 
Decision Process (POMDP) framework for spectrum access 
__. They used POMDP approach to find an optimum policy 
for a single SU case. To generalize this solution to multi SUs, 
the simple Carrier Sense Multiple Access/Collision Avoidance 



(CSMA/CA) protocol was employed [5]. In another related 
work __, it i s assumed that SUs obtain similar observations 
of PUs' channels and therefore converge to the same oppor- 
tunity assessment if they employ single user strategy. Results 
demonstrate that applying optimal single user strategy to multi- 
user setting causes significant degrade in network throughput 
performance __. To overcome this problem, it was shown that, 
using a randomized policy selection, the network performance 
can be improved. However, this policy does not guarantee QoS 
requirements among competing SUs. Besides, SUs' collisions 
history is not considered in deriving belief vector. In other 
related work, Liu et al. considered two interfering SUs in a 
two-channel primary network 1 1 1 . Each SU observes different 
spectrum opportunity on each channel because of assumed 
structure of PUs' network. They proposed a myopic policy for 
SUs in which both SUs exchange their beliefs in each time slot. 
Simulations illustrate that this myopic policy achieves near- 
optimal network throughput. 

Considering time-invariant spectrum opportunities, several 
works have been done using game theory li TTI . 1121 . Recently, 
Fu and van der Schaar utilized stochastic games to present a 
solution for dynamic interaction among competing SUs |6|. A 
Central Spectrum Moderator (CSM) is required in this model 
whose task is to announce the state of all channels to SUs in 
each time slot. However, having a centralized moderator is not 
practical in some cases and SUs can not sense all of channels 
in a limited time of single slot. In __S |, Pham et al. proposed 
a game theoretic approach to QoS-aware channel selection for 
SUs which maximizes network throughput. They assumed that 
SUs know the spectrum availability before selecting appropri- 
ate channel. 

Partially Observable Stochastic Game (POSG) is a general 
framework to solve multi-agent decision process fl4l . In 
POSG, the state of the channel changes based on a discrete 
Markov Model and is partially observable to all agents. In 
this framework, each agent tries to maximize its own reward 
function in a repeated game. Hansen et al. proposed a Dynamic 
Programming (DP) approach to solve the problem of POSG. 
As a special case of POSG, the Dec-POMDP framework was 
investigated in |14| and |'T31. using the DP algorithm. In Dec- 
POMDP, all agents try to maximize a common reward function. 
Solving a Dec-POMDP problem by the DP algorithm becomes 
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Fig. 1. Cognitive Radio Network 



intractable when the horizon length of decision process in- 
creases. For instance, the DP algorithm runs out of memory 
even for a small horizon length in a trivial example |16|. 
Seuken and Zilberstein developed Memory Bounded Dynamic 
Programming (MBDP) to overcome the time complexity of 
existing DP algorithm ifTTl . 

In this paper, our goal is to design QoS-aware joint policies 
for sensing decisions of SUs in order to maximize network 
throughput. It is assumed that PUs' network is slotted and 
all SUs have same spectrum opportunities in each time slot 
(see Fig. Q] SUs are in the transmission range of all PUs and 
different SUs have same observations of a specific channel). 
At first sight, to maximize the network throughput, each SU 
should be assigned a channel to exploit spectrum opportunities. 
Therefore, SUs avoid collisions with each other. This scheme 
is called partitioning strategy IfTOl . However, this strategy 
does not guarantee QoS requirements for SUs. For instance, 
when the probability of idle state is not the same for all 
channels, this strategy does not satisfy fairness. We formulate 
multiple SUs' joint policies by Dec-POMDP. In the proposed 
method, the MBDP algorithm is employed to find optimum 
or sub-optimum joint policies for large horizon length. A 
joint policy which guarantees QoS requirement is selected to 
obtain sensing strategy for SUs. To the best of our knowledge, 
the problem of synchronizing SUs for multi user setting, in 
the presence of collisions, have received little attention. The 
proposed method ensures transceiver SUs synchronization. To 
demonstrate the efficiency of the proposed method, we consider 
two SUs trying to access two-channel PUs' network modeled 
by discrete Markov chains. It is assumed that SUs have 
perfect sensing capability. Simulations yield these interesting 
findings: The MBDP algorithm obtains optimum solution for 
mentioned scenario. It is interesting to note that this algorithm 
is an approximate solution for Dec-POMDP and it does not 
guarantee to find an optimum joint policy. Moreover, there 
exists a joint policy which satisfies QoS constraints considered 
in simulations. Finally, comparing with two other related works 
(9), iflOl , we find out that the proposed method outperforms 
|9l , IflOl in terms of network throughput. 



This paper is organized as follows. In Section II, we give 
an overview on Dec-POMDP. We also review the MBDP 
algorithm for Dec-POMDP. In Section III, the system model 
and Dec-POMDP formulations for cognitive radio network are 
described. In Section IV, we propose our method to extract 
QoS-aware joint policies. In Section V, as an example of 
our Dec-POMDP formulation, we define a scenario with two 
SUs trying to access a two-channel PUs' network. Also, the 
numerical simulation and results are provided. Finally, the 
conclusion is presented. 

II. Definitions and Preliminaries 

In this section, we briefly review finite-horizon Dec-POMDP 
framework and the MBDP solution proposed for handling 
intractability problem of the DP algorithm. More details on 
the DP and MBDP algorithms could be found in 01), (B), 

ma. 

A. Decentralized Partially Observable Markov Decision Pro- 
cess 

A Dec-POMDP is a tuple (7, S, b°, Ai,O u P, R), where, 
-/ is a finite set of SUs indexed l,...,n. 
-S is a finite set of states. 

-b° £ A (S) represents the initial state distribution. 
-Ai is a finite set of actions available to SU i and A =x ie jAi 
is the set of joint actions, where a = (ai,...,a n ) denotes a 
joint action. 

-Oi is a finite set of observations for SU i and O =x ie jOi is 
the set of joint observations, where o = (oi,...,o„) denotes a 
joint observation. 

-P is the set of Markovian state transition and observation 
probabilities, where P(s' ,o\s,a) denotes the probability that 
choosing joint action a in state s yields a transition to the 
state s' and the joint observation o. 

-R : SxA-^-R is a reward function which depends on joint 
actions and current state. 

Dec-POMDP may be defined over a finite or infinite se- 
quence of stages. In this paper, we focus on finite horizon Dec- 
POMDP. At each stage, all SUs simultaneously select an action 
and receive an observation. The reward for SUs is computed 
based on their action and state of channels. The goal is to 
maximize the expected sum of rewards: 

B. Memory Bounded Dynamic Programming for Dec-POMDP 

Solving a Dec-POMDP means finding a joint policy that 
maximizes the expected total reward. A policy for a single 
agent i can be represented by a decision tree qi, where nodes 
are labeled with actions and arcs are labeled with observations 
(a so called a policy tree). If Q\ denotes a set of horizon-f 
policy trees for agent i, a solution to Dec-POMDP with horizon 
t can then be seen as a vector of horizon-f policy trees 5 f = 
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TABLE I 

The pseudocode for the MBDP Algorithm ifTTl 



The MBDP Algorithm 



begin 



1 maxTrees i — max number of trees before backup 

2 T < — horizon of the Dec-POMDP 

3 B i — pre-compute relevant belief for each horizon 

4 Qi, Qj i — initialize 1-step policy trees for each SU 

5 for t = 1 to T do 

6 Q* + \ Q* +1 <— Backup(Q^), Backup(Qp 

7 Sell +1 ,Sel^ +1 < — empty 

8 for k = 1 to maxTrees do 

9 choose relevant belief (b g B) for horizon t 

10 for each q t g Q* +1 , qy g do 

1 1 evaluate each pair (qi , qj ) with respect to b 

12 end 

13 add best policy trees to SeZ* +1 and Sel* +1 

14 delete these policy trees from Q*^ and 

15 end 

16 Qt + \Q* +1 <— Se4 +1 ,5e2| +1 

17 end 

18 select best joint policy tree <5 T from {Q',Q*} 

19 return S T 



end 



(g* , q^) (a so called joint policy tree) where 6 Q*. These 
policy trees can be constructed in two different approaches: 
top-down or bottom-up. 

The first algorithm for solving Dec-POMDPs used a bottom- 
up approach lfT4l . Policy trees are constructed incrementally 
which means that the algorithm starts at the frontiers and 
works its way up to the roots using the DP algorithm. The DP 
algorithm updates in two steps. In the first step, the DP operator 
is given by a set Q* of depth-t policy trees. A set of depth-t 
+ 1 policy trees, Q t+1 , is generated by considering any depth- 
t + 1 policy tree that makes a transition after an action and 
observation to the root node of depth-t policy tree in Q l . This 
step is called exhaustive backup |14|. In exhaustive backup, 
A||<5*|l°l depth-t + 1 policy trees are created. It is clear that 
the total number of constructed trees in each step increases 
exponentially. To alleviate this problem, unnecessary trees are 
pruned [14|. However, this modified DP algorithm runs out of 
memory even for simple problems [17|. One of drawbacks in 
the pruning process is that it cannot predict which beliefs about 
the state and about the other SUs policies will eventually be 
useful before reaching the root of the policy trees. The MBDP 
algorithm combines the bottom-up and top-down approaches. 
By using top-down heuristics to find out relevant belief states, 
the DP algorithm compare the bottom-up policy trees and select 
the best joint policy. Some top-down heuristic policies are 



proposed in IfTTl . It is obvious that the state of channel is not 
affected by actions of SUs. Therefore, we can easily compute 
the most probable beliefs using the initial belief and Markov 
models of channels. The MBDP algorithm which is used in this 
paper is shown in TableU The algorithm is written for two SUs 
i and j. It can be rewritten to any arbitrary number of SUs. The 
parameter maxTree denotes the number of policy trees that 
are used in exhaustive backup for constructing next stage. In 
other words, the size of set Backup(Qf) is |j4||77iaxTree|l°L 
To evaluate each pair with respect to the belief vector b, 

the concept of value vector in POMDP is employed |18|. The 
expected sum of reward with respect to the belief b, U*(6), is 
computed by dot product of value vector and assumed belief: 

V\b) =J2 Ks)v\s) (2) 
ses 

where v f is a 1 5 1 -dimensional vector. For a depth-i + 1 joint 
policy trees 5 t+1 — (q\ +1 , the value vector is: 

v t+1 (s) = R(s) 

+ j2 p ( s \ s ^ t+1 )iY, ^MM* + V(s'^ t+1 (3))] 

(3) 

where S t+1 (o) is the joint policy of subtrees selected by SUs 
after observation vector o. In [17 1, it is proved that the MBDP 
algorithm has a linear time complexity with respect to the 
horizon length. 

III. Problem Definition 

A. System Model 

Our model consists of: i) A spectrum with C channels, 
assigned to PUs; ii) N PUs and M SUs. It is assumed that 
all PUs and SUs communicate in a synchronous slot structure 
(5) and all SUs have same spectrum opportunities in each time 
slot (see Fig. [TJ. Each SU uses the beginning of each time 
slot to sense one of the C channels. Based on the obtained 
observation, SUs can choose to either transmit on one of the 
C channels or not to transmit at all. At the end of the time 
slot, SUs receives ACK from their corresponding receiver to 
know that if transmission was successful or not (see Fig. O. 

B. Dec-POMDP Formulation 

For each SU, we have the following set of actions: 

Ai = {Li.i, Li.c, Ti.i, Ti.c, Di} (4) 

where i = 1,...,M. represents the action of sensing 
channel j for SU i. Li.jS are only used in sensing level as 
shown in Fig. [3] T; j denotes the action of accessing channel 
j by SU i. It should be noted that T^js are only used in 
transmission level as depicted in Fig. [3] Di shows the action 
that SU i does not send its data and stays silent during the 
current time slot. Each channel is assumed to have a two state 
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Sensing 



Fig. 2. A time slot 

Markov model and also is independent of the other channels. 
Thus, we have: 

St = {0, 1} (5) 

where Si shows the set of states for channel i. Channel i is 
available for SUs if Si = 1. The probability of state transition 
for Markov model of channel i is represented by fc where 
j, k G {0, 1}. For SU i, the following set of observations is 
considered: 

O i = {U i ,F i ,CO i ,NC i } (6) 

where Ui shows that the channel i is occupied by PUs and F, 
represents that channel i is idle after sensing. In addition, CO 
shows that there is a collision after transmission on channel 
and Nd denotes that there is no collision after action Ti. CO 
and Nd are obtained through ACK signal sent by receiver. U, 
and Fi are observations in sensing level while COi and NG, 
are observations in transmission level. 

We define Ri(t), the reward function for SU i in time slot 
t, as follows: 



If SU i does not send or recieves no ACK 

1 If SU i receives ACK 



Ri(t) 

(7) 

It is clear that the reward function Ri(t) depends on joint 
actions of SUs and states of channels. The reward function 




for Dec-POMDP formulation in time slot t is obtained from 
the sum of all SUs' rewards: 



M 



R(t) = Y,Ri(t) 



(8) 



IV. 



QoS-Aware Joint Policies 

In this section, we propose a method to find QoS-aware joint 
policies. The proposed method consists of two phases: 1- First 
Phase: In this phase, a set of optimum or sub-optimum joint 
policies are obtained using the MBDP algorithm. 2- Second 
Phase: Among joint policies found in the first phase, the one 
which satisfies QoS constraints is selected. 

A. Fisrt Phase 

A set of optimum or sub-optimum joint policies are found 
in this phase by the MBDP algorithm. The MBDP algorithm 
has three inputs (see Fig.|4]i: 1- time horizon (7): the duration 
of decision process. 2- initial belief state (b°): the initial 
probability distribution of PUs' channel occupation. 3- A set of 
transition probabilities of channels: the transition probabilities 
of Markov chains (p* k s). The MBDP algorithm is executed 
for a specified number of trials. 

B. Second Phase 

In this phase, the joint policy that meets QoS constraints is 
selected. We define QoS constraints as the expected throughput 
of SUs which is given by the following equation: 



Rf Rf 



a i 



n 2 



pT 

a M 



where Rf is: 



Rj =E 



' T 
. t=l 



Ri(t) 



(9) 



(10) 



and {a,i\i G {1, M }} is a set of QoS parameters. 

Definition- We say joint policy 5 T satisfies QoS constriants if 
the Rf s meet following inequalities for an assumed parameter 





l,...,M,3f G 0,(Rr, 



M 

i=i 



Fig. 3. Sets of depth-i + 1 policy tree, Q] +1 for i-th SU 



\Rf -ait\ < C 
(ID 

where R ma x is the maximum achievable throughput of cogni- 
tive radio network. 

C. Notes on Implementation Issues 

For different QoS constraints, optimum or sub-optimum 
joint policy trees are computed off-line and saved in the 
SUs' memory. Each joint policy tree has an identity number. 
Before SUs start sending, the initial belief is set to steady 
state distribution of channels. Moreover, SUs determine the 
initial belief precisely if they are allowed to sense all channels 
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Inputs: 

Time horizon T 

Initial belief state b" 



Trials 



First Phase 




Second Phase 



Output: 




Fig. 4. The Proposed Method 



in the first slot. For the predefined horizon length (T) and 
assumed initial belief, SUs select an appropriate joint policy 
tree from their memory which guarantees QoS constraints. 
Afterwards, SUs send the identity number of the selected joint 
policy to their corresponding receivers in the first slot. The 
receiver tracks the transmitter's sensing action from decision 
tree and observes the channel which the transmitter is currently 
sensing. Because of the perfect sensing capability and same 
opportunistic spectrum for all SUs, the transceiver SUs are 
synchronized by knowing the selected joint policy tree. 

As the horizon length is incremented by one, the number 
of leaves in decision tree increases exponentially. Therefore, 
saving trees requires large amount of memory. We assume 
that SUs transmit for a few number of slots in which saving 
decision trees is efficient. This assumption is reasonable in 
cases that statistical behavior of channels (e.g. the transition 
probability of Markov models) changes frequently and using a 
joint policy for very large horizon is not rational. 

V. Simulation 

To evaluate the performance of the proposed method, we 
consider two SUs in two-channel PUs' network. Each SU 
senses one of two channels in each time slot. The tran- 
sition probabilities of channels are: [pj x ,p\ ,p§ 1: p\ ] = 
[0.15, 0.95, 0.95, 0.15]. According to this distribution, channels 
1 and 2 do not have same steady state probabilities. Channel 
1 is busy most of the time. It is also assumed that S% = 1 
and 52 = in the first slot which determines the initial 
belief for the MBDP algorithm. Furthermore, each SU does 
not send and waits for next time slot if the observed channel 
is occupied by PUs. If the channel is idle, the SU sends. 



Besides, both SUs have perfect sensing capability. Considering 
above assumptions, we get a decision tree. An example of this 
tree is shown in Fig. |5(a)| Li shows the action of sensing 
channel i. {F, N} denotes that the observed channel is free 
and transmission is successful. {F,C} denotes the observed 
channel is free but collision happens with other SU during 
transmission. {U} means that observed channel is busy and 
the SU waits for the next time slot. An example of joint policy 
tree for T — 5 and ^ = | is illustrated in Fig. The 
set of observations for arcs in subtrees' root are as same as 
the original root. These observations are omitted because of 
simplicity of illustration. 

The proposed method is compared with two related works on 
multi SU scheme J5J, ifTOl . A multiuser heuristic (MH) policy 
for sensing channels is proposed in 0. In other words, for 
SU i with belief vector on availability of channels fiW (t) — 
[u!^\t),0J2 \t)], the probability Pn (t) of choosing channel n 
in slot t is given by: 

P« = (12) 

k=l 

In the other work fit)! , it is assumed that SUs have different 
spectrum opportunities. A cooperation strategy which achieves 
near optimal throughput is given in iflOl . In this strategy, SUs 
exchange their belief vector and use these information to take 
an action. If we rewrite cooperative approach formulation when 
SUs have same spectrum opportunities, the sensing strategy 
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(b) SU2 
Fig. 5. Joint Policy tree 

would be as follows: 

/ SU1 : Li, SU2 : L2 If u{ 1] (t) + lo^ (t) > uj[ 2) (t) + (t) 
1 SU1 : L 2 , SU2 : LI Otherwise 

(13) 

To find optimum or sub-optimum policies, we run the 
MBDP algorithm for 30 times in each horizon length (7). The 
parameter maxTree is set to three. The results of network 
throughput is shown in Fig. [6] The results are normalized 
to maximum achievable throughput of network. This figure 
demonstrates that the proposed method outperforms (9), iflOll - 
We also consider the case ai — 0,2- Parameter £ is 0.25 in 
this simulation. The results for these settings are shown in 
Fig. [7] for both proposed method and cooperative approach. In 
the cooperative approach, the second SU's throughput starves. 
However, SUs have approximately fair throughput in the 
proposed method. The expected throughput for two SUs and 
different QoS constraints are presented in Table [II] Parameter 
C is set to 0.25. The pair in each entry denotes the throughput 
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of SU1 and SU2, respectively. Numerical results show that the 
proposed method guarantees QoS constrains. 
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Fig. 7. SUs' Throughput (Fairness Comparison) 



TABLE II 

Achieved throughput for Simulated QoS Constraints 



Horizon length(T) 


£1 — 1 

a?. 


?± = 1.5 

a,. 


£1 = 2 

a? 


4 


2,2 


2.55,1.77 


2.66,1.34 


5 


2.6150,2.385 


3,2 


3.5,1.5 


6 


3.03,2.97 


3.7275,2.39 


4,2 


7 


3.645,3.355 


4.32,2.68 


4.56,2.44 


8 


4.02,3.98 


4.9,3.1 


5.44,2.65 


9 


4.6,4.4 


5.5,3.5 


5.86,3.14 


10 


5.03,4.97 


5.92,4.08 


6.62,3.38 



VI. CONCLUSION 

We proposed a new method to guarantee QoS require- 
ments for sensing-based protocols in multi SUs' network. 
By employing the MBDP algorithm, some optimum or sub- 
optimum joint policies are found and the one that satisfies QoS 
constraints is selected as the joint sensing strategy. Besides, 
the proposed method ensures transceiver SUs synchronization. 
For two SUs in a two-channel PUs' network, simulations 
demonstrated that the proposed method achieves maximum 
throughput. Moreover, results show that the proposed method 
guarantees different QoS constraints. 
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